arXiv: 1504.05143vl [cs.NE] 20 Apr 2015 


Network Plasticity as Bayesian Inference 


David Kappel^’^, Stefan Habenschuss^’^, Robert Legenstein^ , Wolfgang Maass^ 


^Institnte for Theoretical Compnter Science, Graz University of Technology, A-8010 Graz, Anstria 

^These anthors contribnted eqnally to this work. 

Gorresponding anthor: kappel@igi.tngraz.at 
http://www.igi.tugraz.at/kappel 

April 21, 2015 


Abstract 

General results from statistical learning theory suggest to understand not only brain computations, but 
also brain plasticity as probabilistic inference. But a model for that has been missing. We propose that 
inherently stochastic features of synaptic plasticity and spine motility enable cortical networks of neurons to 
carry out probabilistic inference by sampling from a posterior distribution of network configurations. This 
model provides a viable alternative to existing models that propose convergence of parameters to maximum 
likelihood values. It explains how priors on weight distributions and connection probabilities can be merged 
optimally with learned experience, how cortical networks can generalize learned information so well to novel 
experiences, and how they can compensate continuously for unforeseen disturbances of the network. The 
resulting new theory of network plasticity explains from a functional perspective a number of experimental 
data on stochastic aspects of synaptic plasticity that previously appeared to be quite puzzling. 


1 Introduction 

We reexamine in this article the conceptual and mathematical framework for understanding the organization 
of plasticity in networks of neurons in the brain. We will focus on synaptic plasticity and network rewiring 
(spine motility) in this article, but our framework is also applicable to other network plasticity processes. One 
commonly assumes, that plasticity moves network parameters 6 (such as synaptic connections between neurons 
and synaptic weights) to values 6* that are optimal for the current computational function of the network. In 
learning theory, this view is made precise for example as maximum likelihood learning, where model parameters 
6 are moved to values 6* that maximize the fit of the resulting internal model to the inputs x that impinge 
on the network from its environment (by maximizing the likelihood of these inputs x). The convergence to 6* 
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is often assumed to be facilitated by some external regulation of learning rates, that reduces the learning rate 
when the network approaches an optimal solution. 

This view of network plasticity has been challenged on several grounds. From the theoretical perspective it 
is problematic because in the absence of an intelligent external controller it is likely to lead to overfitting of the 
internal model to the inputs x it has received, thereby reducing its capability to generalize learned knowledge 
to new inputs. Furthermore, networks of neurons in the brain are apparently exposed to a multitude of internal 
and external changes and perturbations, to which they have to respond quickly in order to maintain stable 
functionality. In addition, experimental data suggest that these networks are simultaneously able to maintain 
structural constraints and rules such as the empirically found connection probability between specific types of 
neurons, and heavy-tailed distributions of synaptic weights [T][^, which are in conflict with maximum likelihood 
learning models. Other experimental data point to surprising ongoing fluctuations in dendritic spines and spine 
volumes, to some extent even in the adult brain and in the absence of synaptic activity [^. Also a signihcant 
portion of axonal side branches and axonal boutons were found to appear and disapper within a week in adult 
visual cortex, even in the absence of imposed learning and lesions [^. Furthermore surprising random drifts 
of tuning curves of neurons in motor cortex were observed [^. Apart from such continuously ongoing changes 
in synaptic connections and tuning curves of neurons, massive changes in synaptic connectivity were found to 
accompany functional reorganization of primary visual cortex after lesions, see e.g. [^. 

We therefore propose to view network plasticity as a process that continuously moves high-dimensional 
network parameters 6 within some low-dimensional manifold that represents a compromise between overriding 
structural rules and different ways of fitting the internal model to external inputs x. We propose that ongoing 
stochastic fluctuations (not unlike Brownian motion) continuously drive network parameters 9 within such 
low-dimensional manifold. The primary conceptual innovation is the departure from the traditional view of 
learning as moving parameters to values 6* that represent optimal (or locally optimal) fits to network inputs 
X. We show that our alternative view can be turned into a precise learning model within the framework of 
probability theory. This new model satishes theoretical requirements for handling priors such as structural 
constraints and rules in a principled manner, that have previously already been formulated and explored in the 
context of artificial neural networks [sl^, as well as more recent challenges that arise from probabilistic brain 


models 10 . The low-dimensional manifold of parameters 9 that becomes the new learning goal in our model can 
be characterized mathematically as the high probability regions of the posterior distribution p*{9\x) of network 
parameters 9. This posterior arises as product of a general prior psi9) for network parameters (that enforces 
structural rules) with a term that describes the quality of the current internal model (e.g. in a predictive 
coding or generative modeling framework: the likelihood pj^{x\9) of inputs x for the current parameter values 
9 of the network M). More precisely, we propose that brain plasticity mechanisms are designed to enable 
brain networks to sample from this posterior distribution p*{9\x) through inherent stochastic features of their 
molecular implementation. In this way synaptic and other plasticity processes are able to carry out probabilistic 
(or Bayesian) inference through sampling from a posterior distribution that takes into account both structural 
rules and fitting to external inputs. Hence this model provides a solution to the challenge of 10 to understand 
how posterior distributions of weights can be represented and learned by networks of neurons in the brain. 

This new model proposes to reexamine rules for synaptic plasticity. Rather than viewing trial-to-trial 
variability and ongoing fluctuations of synaptic parameters as the result of a suboptimal implementation of an 
inherently deterministic plasticity process, it proposes to model experimental data on synaptic plasticity by rules 
that consist of three terms: the standard (typically deterministic) activity-dependent (e.g., Hebbian or STDP) 
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term that fits the model to external inputs, a second term that enforces structural rules (priors), and a third 
term that provides the stochastic driving force. This stochastic force enables network parameters to sample 
from the posterior, i.e., to fluctuate between different possible solutions of the learning task. The stochastic 
third term can be modeled by a standard formalism (stochastic Wiener process) that had been developed to 
model Brownian motion. The first two terms can be modeled as drift terms in a stochastic process. A key 
insight is that one can easily relate details of the resulting more complex rules for the dynamics of network 
parameters 0, which now become stochastic differential equations, to specific features of the resulting posterior 
distribution p*(0|x) of parameter vectors 6 from which the network samples. Thereby, this theory provides 
a new framework for relating experimentally observed details of local plasticity mechanisms (including their 
typically stochastic implementation on the molecular scale) to functional consequences of network learning. 
For example, one gets a theoretically founded framework for relating experimental data on spine motility 
to experimentally observed network properties, such as sparse connectivity, specific distributions of synaptic 
weights, and the capability to compensate against perturbations [IT] . 

We demonstrate the resulting new style of modeling network plasticity in three examples. These examples 
demonstrate how previously mentioned functional demands on network plasticity, such as incorporation of 
structural rules, automatic avoidance of overfitting, and inherent and immediate compensation for network 
perturbances, can be accomplished through stochastic local plasticity processes. We focus here on common 
models for unsupervised learning in networks of neurons: generative models. We hrst develop the general 
learning theory for this class of models, and then describe applications to common non-spiking and spiking 


generative network models. Both structural plasticity (see 12,13 for reviews) and synaptic plasticity (STDP) 
are integrated into the resulting theory of network plasticity. 


2 Results 


We present a new theoretical framework for analyzing and understanding local plasticity mechanisms of net¬ 
works of neurons in the brain as stochastic processes, that generate specihc distributions p{0) of network 
parameters 0 over which these parameters fluctuate. This framework can be used to analyze and model many 
types of learning processes. We illustrate it here for the case of unsupervised learning, i.e., learning without 
a teacher or rewards. Obviously many learning processes in biological organisms are of this nature, especially 
learning processes in early sensory areas, and in other brain areas that have to provide and maintain on their 
own an adequate level of functionality, even in the face of internal or external perturbations. 

A common framework for modeling unsupervised learning in networks of neurons are generative models, 
which date back to the 19’th century, when Helmholtz proposed that perception could be understood as 


unconscious inference 14 . Since then the hypothesis of the “generative brain” has been receiving considerable 


attention, fueling interest in various aspects of the relation between Bayesian inference and the brain |l0|l5|l6j . 
The basic assumption of the “Bayesian brain” theory is that the activity 2 : of neuronal networks in the brain can 
be viewed as an internal model for hidden variables in the outside world that give rise to sensory experiences 
X (such as the response x of auditory sensory neurons to spoken words that are guessed by an internal model 
z). The internal model 2 : is usually assumed to be represented by the activity of neurons in the network, 
e.g., in terms of the firing rates of neurons, or in terms of spatio-temporal spike patterns. A network Af 
of stochastically firing neuron is modeled in this framework by a probability distribution pj\f{x,z\6) that 
describes the probabilistic relationships between N input patterns x = ... ,x^ and corresponding network 
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responses z = z^,, where 9 denotes the vector of network parameters that shape this distribution, e.g., 
via synaptic efficacies and network connectivity. The marginal probability pj\f{x\9) = Y^^pj\f{x,z\9) of the 
actually occurring inputs x = x^,... ,x^ under the resulting internal model of the neural network Af with 
parameters 6 can then be viewed as a measure for the agreement between this internal model (which carries 
out “predictive coding” [17] ) and its environment (which generates the inputs x). 

The goal of network learning is usually described in this probabilistic generative framework as finding 
parameter values 6* that maximize this agreement, or equivalently the likelihood of the inputs x (maximum 
likelihood learning): 

6 * = argmaxp;vr(x|0). (1) 

0 

Locally optimal parameter solutions are usually determined by gradient ascent on the data likelihood pj^{x.\9). 


2.1 Learning a posterior distribution through stochastic synaptic plasticity 

In contrast, we assume here that not only a neural network AA, but also a prior ps{9) for its parameters are 
given. This prior ps can encode both structural constraints (such as sparse connectivity) and structural rules 
(e.g., a heavy-tailed distribution of synaptic weights). Then the goal of network learning becomes: 


learn the posterior distribution p*{6\x) defined (up to normalization) by ps{9) ■ pj^{x\9) 


( 2 ) 


The patterns x = ..., x^ are assumed here to be regularly reoccurring network inputs. 

A key insight (see Fig. for an illustration) is that stochastic local plasticity rules for the parameters 6i 
enable a network to achieve the learning goal ([^ : The distribution of network parameters 6 will converge after 
a while to the posterior distribution p*{9) = p*{6\x) - and produce samples from it - if each network parameter 
9i obeys the dynamics 


d6i = b 


— logps{9) + — logpj^{x\e) ] dt + V^dWi , 


(3) 


where the learning rate b > 0 controls the speed of the parameter dynamics. Eq. ([^ is a stochastic differential 
equation (see [^), which differs from commonly considered differential equations through the stochastic term 
dWi that describes infinitesimal stochastic increments and decrements of a Wiener process Wj. A Wiener 
process is a standard model for Brownian motion in one dimension (more precisely: the limit of a random 
walk with infinitesimal step size and normally distributed increments Wf — Wf ~ NoRMAL(0,t — s) between 
times t and s). Thus in an approximation of Q for discrete time steps At the term dWi can be replaced by 
Gaussian noise with variance At (see Eq. 0). Note that Eq. (j^ does not have a single solution ^^(t), but a 
continuum of stochastic sample paths (see Eig. for an example) that each describe one possible time course 
of the parameter 


Rigorous mathematical results based on Eokker-Planck equations (see Methods and Supplement for details) 
allow us to infer from the stochastic local dynamics of the parameters 6i given by a stochastic differential 
equation of the form 0 the probability that the parameter vector 6 can be found after a while in a particular 
region of the high-dimensional space in which it moves. The key result is that for the case of the stochastic 
dynamics according to Eq. 0 this probability is equal to the posterior p*{6\x) given by Eq. 0. Hence the 
stochastic dynamics 0 of network parameters 0* enables a network to achieve the learning goal 0: to learn 
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the posterior distribution p*(0|x). This posterior distribution is not represented in the network through any 
explicit neural code, but through its stochastic dynamics, as the unique stationary distribution of a Markov 
process from which it samples continuously. In particular, if most of the mass of this posterior distribution 
is concentrated on some low-dimensional manifold, the network parameters 6 will move most of the time 
within this low-dimensional manifold. Since this realization of the posterior distribution p*{6\x.) is achieved by 
sampling from it, we refer to this model Q (in the case where the parameters 9i represent synaptic parameters) 
as synaptic sampling. 

The stochastic term dWj in Eq. Q provides a simple integrative model for a multitude of biological and 
biochemical stochastic processes that effect the efficacy of a synaptic connection. The mammalian postsynaptic 
density comprises over 1000 different types of proteins [^. Many of those proteins that effect the amplitude 
of postsynaptic potentials and synaptic plasticity, for example NMDA receptors, occur in small numbers, and 
are subject to Brownian motion within the membrane 20 . In addition, the turnover of important scaffolding 
proteins in the postsynaptic density such as PSD-95, which clusters glutamate receptors and is thought to have 
a substantial impact on synaptic efficacy, is relatively fast, on the time-scale of hours to days, depending on 
developmental state and environmental condition |21|. Also the volume of spines at dendrites, which is assumed 


to be directly related to synaptic efficacy 22,23 is reported to ffuctuate continuously, even in the absence of 
synaptic activity . Furthermore the stochastically varying internal states of multiple interacting biochemical 
signaling pathways in the postsynaptic neuron are likely to effect synaptic transmission and plasticity [24] . 

The contribution of the stochastic term dWi in ^ can be scaled by a temperature parameter \/r, where T 
can be any positive number. The resulting stationary distribution of 6 is proportional to p* (6) t ^ so that the 
dynamics of the stochastic process can be described by the energy landscape ^ . For high values of T this 

energy landscape is flattened, i.e., the main modes olp*{0) become less pronounced. For T —)• 0 the dynamics 
of 9 approaches a deterministic process and converges to the next local maximum of p*{9). Thus the learning 
process approximates for low values of T maximum a posteriori (MAP) inference [^. We propose that this 
temperature parameter T is regulated in biological networks of neurons in dependence of the developmental 
state, environment, and behavior of an organism. One can also accommodate a modulation of the dynamics of 
each individual parameter 9i by a learning rate h{9i) that depends on its current value (see Methods). 


2.2 Online synaptic sampling 

For online learning one assumes that the likelihood p^{ii\0) = p^{x^, ..., x^\6) of the network inputs can be 
factorized: 

N 

PM{x^,...,x^\e) = W^p^{x^\e), (4) 

n=l 

i.e., each network input a?” can be explained as being drawn individually from p_y(®"'|0), independently from 
other inputs. 

The weight update rule Q depends on all inputs x = a;^,... ,x^, hence synapses have to keep track of 
the whole set of all network inputs for the exact dynamics (batch learning). In an online scenario, we assume 
that only the current network input x'^ is available for synaptic sampling. One then arrives at the following 
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Figure 1: Maximum likelihood (ML) learning vs. synaptic sampling. (A,B,C) Illustration of ML 
learning for two parameters 6 = {0i, 02} of a neural network A/". (A) 3D plot of an example likelihood function. 
For a fixed set of inputs x it assigns a probability density (amplitude on z-axis) to each parameter setting 
0. (B) This likelihood function is dehned by some underlying neural network N. (C) Multiple trajectories 
along the gradient of the likelihood function in (A). The parameters are initialized at random initial values 
(black dots) and then follow the gradient to a local maximum (red triangles). (D) Example for a prior that 
prefers small values for d. (E) The posterior that results as product of the prior (D) and the likelihood (A). 

(F) A single trajectory of synaptic sampling from the posterior (E), starting at the black dot. The parameter 
vector 6 fluctuates between different solutions, the visited values cluster near local optima (red triangles). 

(G) Cartoon illustrating the dynamic forces (plasticity rule Q) that enable the network to sample from the 
posterior distribution p*{0\'x) in (E). Magnihcation of one synaptic sampling step dO of the trajectory in (F) 
(green). The three forces acting on 6: the deterministic drift term (red) is directed to the next local maximum 
(red triangle), it consists of the hrst two terms in Eq. Q; the stochastic diffusion term dW (black) has a 
random direction. See [Supplement] for hgure details. 
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online-approximation to i) 


dOi = b 


\ogps{9) + ^ ^ logPA^(®"|0) ] dt + dWi . 


do 


(5) 


Note the additional factor N in the rule. It compensates for the A^-fold summation of the first and last term 
in ([^ when one moves through all N inputs x'^. Although convergence to the correct posterior distribution 


cannot be guaranteed theoretically for this online rule, we show in Methods that the rule is a reasonable 
approximation to the batch-rule Q. Furthermore, all subsequent simulations are based on this online rule, 
which demonstrates the viability of this approximation. 


2.3 Relationship to maximum likelihood learning 

Typically, synaptic plasticity in generative network models is modeled as maximum likelihood learning. Time 
is often discretized into small discrete time steps At. For gradient-based approaches the parameter change 
is then given by the gradient of the log likelihood multiplied with some learning rate rj: 

= ry A \ogpj^(x^\0) . (6) 


To compare this maximum likelihood update with synaptic sampling, we consider a version of the parameter 
dynamics (§ for discrete time (see |Methods| for a derivation): 


o o 

= ry ( — logps{0) + ^ ^ \ogpj^{x^\e) 


+ 


(7) 


where the learning rate ry is given by ry = 6At and zz* denotes Gaussian noise with zero mean and variance 
1, drawn independently for each parameter 9i and each update time t. We see that the maximum likelihood 
update Q becomes one term in this online version of synaptic sampling. Equation Q is a special case of the 
online Langevin sampler that was introduced in |25| . 

The first term ^ logps{6) in Q arises from the prior psi9), and has apparently not been considered in 
previous rules for synaptic plasticity. An additional novel component is the Gaussian noise term (see also 
Fig. [^). It arises because the accumulated impact of the Wiener process Wj over a time interval of length 
At is distributed according to a normal distribution with variance At. In contrast to traditional maximum 
likelihood optimization based on additive noise for escaping local optima, this noise term is not scaled down 
when learning approaches a local optimum. This ongoing noise is essential for enabling the network to sample 
from the posterior distribution p*{9) via continuously ongoing synaptic plasticity (see Fig. and Fig. [^). 


2.4 Synaptic sampling improves the generalization capability of a neural network 

The previously described theory for learning a posterior distribution over parameters 6 can be applied to all 
neural network models M where the derivative ^ logpj\f{x'^\6) in ([^ can be efficiently estimated. Since this 
term also has to be estimated for maximum likelihood learning ([^, synaptic sampling can basically be applied 
to all neuron and network models that are amenable to maximum likelihood learning. We illustrate salient 
new features that result from synaptic sampling (i.e., plasticity rules ([^ or Q) for some of these models. We 
begin with the Boltzmann machine [^, one of the oldest generative neural network models. It is currently still 
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Figure 2: Priors for synaptic weights improve generalization capability. (A) The training set, consist¬ 
ing of five samples of a handwritten 1. Below a cartoon illustrating the network architecture of the restricted 
Boltzmann machine (RBM), composed of a layer of 784 visible neurons x and a layer of 9 hidden neurons z. 

(B) Examples from the test set. It contains many different styles of writing that are not part of the training set. 

(C) Evolution of 50 randomly selected synaptic weights throughout learning (on the training set). The weight 
histogram (right) shows the distribution of synaptic weights at the end of learning. 80 histogram bins were 
equally spaced between -4 and 4. (D) Performance of the network in terms of log likelihood on the training 
set (blue) and on the test set (red) throughout learning. Mean values over 100 trial runs are shown, shaded 
area indicates std. The performance on the test set initially increases but degrades for prolonged learning. 
(E) Evolution of 50 weights for the same network but with a bimodal prior. The prior ps{w) is indicated 
by the blue curve. Most synaptic weights settle in the mode around 0, but a few larger weights also emerge 
and stabilize in the larger mode. Weight histogram (green) as in (C). (F) The log likelihood on the test set 
maintains a constant high value throughout the whole learning session, compare to (D). 
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extensively investigated in the context of deep learning 27,^. We demonstrate in Fig. ,F the improved 
generalization capability of this model for the learning approach Q (learning of the posterior), compared with 
maximum likelihood learning (approach Q), which had been theoretically predicted by and [^. But this 
model for learning the posterior (approach ([^) in Boltzmann machines is now based on local plasticity rules. 
Note that the Boltzmann machine with synaptic sampling samples simultaneously on two different time scales: 
In addition to sampling for given parameters 6 from likely network states in the usual manner, it now samples 
simultaneously on a slower time scale according to ([^ from the posterior of network parameters 0. 

A Boltzmann machine employs extremely simple non-spiking neuron models with binary outputs. Neuron 
yi outputs 1 with probability CT^Y^jWijVj + h): else 0, where a is the logistic sigmoid ij{u) = , with 

synaptic weights Wij and bias parameters Synaptic connections in a Boltzmann machine are bidirectional, 
with symmetric weights {wij = Wji). The parameters 6 for the Boltzmann machine consist of all weights 
Wij and biases bi in the network. For the special case of a restricted Boltzmann machine (RBM), maximum 


likelihood learning of these parameters can be done efficiently [29] , and therefore RBM’s are typically used for 
deep learning. An RBM has a layered structure with one layer of visible neurons x and a second layer of hidden 
neurons 2 . Synaptic connections are formed only between neurons on different layers (Fig.[^). The maximum 
likelihood gradients logpj\f{x\6) and logpj\f{x\0) can be efficiently approximated 

for this model, for example 


d 


dw. 


iogpj^{x^\e) 


V 




( 8 ) 


where is the output of input neuron j while input x"' is presented, and x^ its output during a subsequent 
phase of spontaneous activity (“reconstruction phase”); analogously for the hidden neuron Zj (see Methods 


and Supplement). 


We integrated this maximum likelihood estimate Q into the synaptic sampling rule Q in order to test 
whether a suitable prior P 5 (w) for the weights improves the generalization capability of the network. The 
network received as input just five samples x^,...,x^ of a handwritten Arabic number 1 from the MNIST 
dataset (the training set, shown in Fig. [^) that were repeatedly presented. Each pixel of the digit images was 
represented by one neuron in the visible layer (which consisted of 784 neurons). We selected a second set of 
100 samples of the handwritten digit 1 from the MNIST dataset as test set (Fig. §3). These samples include 
completely different styles of writing that were not present in the training set. After allowing the network to 
learn the five input samples from Fig. for various numbers of update steps (horizontal axis of Fig. [^,F), we 
evaluated the learned internal model of this network M for the digit 1 by measuring the average log-likelihood 
logpj\f{'x\6) for the test data x. The result is indicated in Fig. IP ,F for the training samples x by the blue 
curves, and for the new test examples x, that were never shown while synaptic plasticity was active, by the 
red curves. 

First, a uniform prior over the synaptic weights was used (Fig. [^), which corresponds to the common 
maximum likelihood learning paradigm Q. The performance on the test set (shown on vertical axis) initially 
increases but degrades for prolonged exposure to the training set (length of that prior exposure shown on 
horizontal axis). This effect is known as overfitting [^|^. It can be reduced by choosing a suitable prior ps{0) 
in the synaptic sampling rule Q. The choice for the prior distribution is best if it matches the statistics of the 
training samples [^, which has in this case two modes (resulting from black and white pixels). The presence 
of this prior in the learning rule maintains good generalization capability for test samples even after prolonged 
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exposure to the training set (red curve in Fig. W)- 
2.5 Spine motility as synaptic sampling 


In the following sections we apply our synaptic sampling framework to networks of spiking neurons and biological 
models for network plasticity. The number and volume of spines for a synaptic connection is thought to be 
directly related to its synaptic weight [30] . Experimental studies have provided a wealth of information about 
the stochastic dynamics of dendritic spines (see e.g. [3l |30[|3^ ). They demonstrate that the volume of a 
substantial fraction of dendritic spines varies continuously over time, and that all the time new spines and 
synaptic connections are formed and existing ones are eliminated. We show that these experimental data on 
spine motility can be understood as special cases of synaptic sampling. The synaptic sampling framework is 
however very general, and many different models for spine motility can be derived from it as special cases. We 
demonstrate this here for one simple model, induced by the following assumptions: 


1. We restrict ourselves to plasticity of excitatory synapses, although the framework is general enough to 
apply to inhibitory synapses as well. 


2. In accordance with experimental studies 30 , we require that spine sizes have a multiplicative dynamics, 
i.e., that the amount of change within some given time window is proportional to the current size of the 
spine. 

3. We assume here for simplicity that a synaptic connection between two neurons is realized by a single 
spine and that there is a single parameter 9i for each potential synaptic connection i. 


The last requirement can be met by encoding the state of the synapse in an abstract form, that represents 
synaptic connectivity and synaptic plasticity in a single parameter 0j. We define that negative values of 6i 
represent a current disconnection and positive values represent a functional synaptic connection. The distance 
of the current value of 9i from zero indicates how likely it is that the synapse will soon reconnect (for negative 
values) or withdraw (for positive values), see Fig. Hh- In addition the synaptic parameter 9i encodes for positive 
values the synaptic efficacy Wi, i.e., the resulting EPSP amplitudes, by a simple mapping Wi = f{9i). 

A large class of mapping functions / is supported by our theory (see Supplement for details). The second 
assumption which requires multiplicative synaptic dynamics supports an exponential function / in our model, 
in accordance with previous models of spine motility [^. Thus, we assume in the following that the efficacy 
Wi of synapse i is given by 


Wi = exp(6»i - 6»o) , (9) 

see Fig. Note that for a large enough offset 9q, negative parameter values 9i (which model a non-functional 
synaptic connection) are automatically mapped onto a tiny region close to zero in the rc-space, so that re¬ 
tracted spines have essentially zero synaptic efficacy. The general rule for online synaptic sampling ([^ for the 
exponential mapping Q can be written as (see Supplement) 

d9i = b(-^ logps{0) -h Nwi logpx{x'^\w)\dt + V^dWi . (10) 


Supplement 
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Figure 3: Integration of spine motility into the synaptic sampling model. (A) Illustration of the 
parametrization of spine motility. Values 0 > 0 indicate a functional synaptic connection. (B) A Gaussian 


prior ps{0), and a few stochastic sample trajectories of 9 according to the synaptic sampling rule (10). Negative 
values of 6 (gray area) are interpreted as non-functional connections. Some stable synaptic connections emerge 
(traces in the upper half), whereas other synaptic connections come and go (traces in lower half). All traces, as 
well as survival statistics shown in (E,F), are taken from the network simulation described in detail in the next 
section and Fig. (C) The exponential function maps synapse parameters 9 to synaptic efficacies w. Negative 
values of 9, corresponding to (retracted) spines are mapped to a tiny region close to zero in the tc-space. (D) 
The Gaussian prior in the 0-space translates to a log-normal distribution in the rc-space. The traces from (B) 
are shown in the right panel transformed into the w-space. Only persistent synaptic connections contribute 
substantial synaptic efficacies. (E,F) The emergent survival statistics of newly formed synaptic connections, 
(i.e., formed during the preceding 12 hours) exhibits in our synaptic sampling model a power-law behavior 
(gray area denotes simulation results, red curves are power-law hts, see Supplement). The time-scale (and 


exponent of the power-law) depends on the learning rate b in equation (10), and can assume any value in our 
quite general model (shown is 6 = 10“^ in (E) and b = 10“® in (F)). 
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In equation (10) the multiplicative synaptic dynamics becomes explicit. The gradient -i^\ogpj\f{x'^\w)^ i.e., 
the activity-dependent contribution to synaptic plasticity, is weighted by Wi. Hence, for negative values of 
9i (non-functional synaptic connection), the activities of the pre- and post-synaptic neurons have negligible 
impact on the dynamics of the synapse. Assuming a large enough 9q, retracted synapses therefore evolve solely 
according to the prior ps{0) and the random fluctuations dW*. For large values of 9i the opposite is the case. 
The influence of the prior ^ logps{0) and the Wiener process dWi become negligible, and the dynamics is 
dominated by the activity-dependent likelihood term. Large synapses can therefore become quite stable if the 
presynaptic activity is strong and reliable (see Fig. [^). Through the use of parameters 6 which determine 
both synaptic connectivity and synaptic efficacies, the synaptic sampling framework provides a unified model 
for structural and synaptic plasticity. The prior distribution can have significant impact on the spine motility, 
encouraging for example sparser or denser synaptic connectivity. If the activity-dependent second term in 
Eq. (|I0[), that tries to maximize the likelihood, is small (e.g., because 9i is small or parameters are near a mode 


of the likelihood) then Eq. (10) implements an Ornstein Uhlenbeck process. This prediction of our model is 
consistent with a previous analysis which showed that an Ornstein Uhlenbeck process is a viable model for 
synaptic spine motility pOl. 


The weight dynamics that emerges through the stochastic process (10) is illustrated in the right panel of 


Fig.[§). A Gaussian parameter prior ps{9i) results in a log-normal prior ps{wi) in a corresponding stochastic 
differential equation for synaptic efficacies Wi 


see 


Supplement for details). 


The last term (noise term) in our synaptic sampling rule (10) predicts that eliminated connections spon¬ 


taneously regrow at irregular intervals. In this way they can test whether they can contribute to explaining 
the input. If they cannot contribute, they disappear again. The resulting power-law behavior of the sur¬ 
vival of newly formed synaptic connections (Fig. ^E,F) matches corresponding new experimental data 34 


and is qualitatively similar to earlier experimental results which revealed a quick decay of transient dendritic 
spines [^,33,35 . Functional consequences of this structural plasticity are explored in Fig. and 


2.6 Fast adaptation of synaptic connections and weights to a changing inpnt statistics 


We will explore in this and the next section implications of the synaptic sampling rule (10) for network plasticity 
in simple generative spike-based neural network models. 


The main types of spike-based generative neural network models that have been proposed are [37f[40] . We 
focus here on the type of models introduced by [^|^|^, since these models allow an easy estimation of the 
likelihood gradient (the second term in (|10[)) and can relate this likelihood term to STDP. Since these spike- 
based neural network models have non-symmetric synaptic connections (that model chemical synapses between 
pyramidal cells in the cortex), they do not allow to regenerate inputs x from internal responses z by running 
the network backwards (like in a Boltzmann machine). Rather they are implicit generative models, where 
synaptic weights from inputs to hidden neurons are interpreted as implicit models for presynaptic activity, 
given that the postsynaptic neuron fires. 

We focus in this section on a simple model for an ubiquitous cortical microcircuit motif: an ensemble 
of pyramidal cells with lateral inhibition, often referred to as Winner-Take-All (WTA) circuit. It has been 
proposed that this microcircuit motif provides for computational analysis an important bridge between single 


neurons and larger brain systems . We employ a simple form of divisive normalization (as proposed by 42 


see Methods) to model lateral inhibition, thereby arriving at a theoretically tractable version of this microcircuit 
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Figure 4: Adaptation of synaptic connections to changing input statistics through synaptic sam¬ 
pling. (A) Illustration of the network architecture. A WTA circuit consisting of ten neurons 2 ; receives afferent 
stimuli from input neurons x (few connections shown for a single neuron in z). (B) The STDP learning curve 
that arises from the likelihood term in equation ( |11[ ). (C) Measured STDP curve that results from a related 
STDP rule for a moderate pairing frequency of 20 Hz, as in [^. (Figure adapted from [^). (D) Two different 
data sets, both consisting of hundreds of samples of handwritten digits were presented to the network. In 
phase 1 and 3 only samples for digit 1 were shown, in phase 2 samples from digits 1 and 2. (E,F) Evolution of 
synaptic parameters for four of the ten neurons in z is shown. A snapshot of the current weight vectors (E), and 
currently active synaptic connections (green dots in (F)) is taken every 1000 seconds (i.e., after showing 4000 
further input samples). It can be seen that most of the initial synapses are eliminated: only the connections 
to input neurons that encode relevant information (i.e., that are not encoding white background pixels most 
of the time) survive. Upon the switch to a different distribution, some eliminated synapses are reactivated, 
while others which were previously active, are eliminated. Note the constant fluctuation of functional synapses 
(sparse green dots in the background of (F)). (G) Time course of a sample synaptic parameter Oi is shown 
(position of the synapse is indicated by red squares in (F)). The parameter prior is shown on the left. This 
synaptic connection turns out to be useful for learning the digit 2, whereas its dynamics during the first and 
third phase is driven by the prior and random walk behavior (Wiener process dW). Consequently it probes 
crossing of the threshold 0 at irregular intervals during phase 1 and 3. (H) The total number of active synapses 
remains low throughout learning (effect of the prior in the learning rule). 


motif that allows us to compute the maximum likelihood term (second term in (|10[)) in the synaptic sampling 
rule. We assumed Gaussian prior distributions ps{0i) over the synaptic parameters 6i (as in Fig. [^). Then 


the synaptic sampling rule (10) yields for this model 


dOi = 2 (^i — p) + Nwi S{t) {xi{t) — a '^dt + V^dWi , 


( 11 ) 


where S{t) denotes the spike train of the postsynaptic neuron and Xi{t) denotes the weight-normalized value 
of the sum of EPSPs from presynaptic neuron i at time t (i.e., the summed EPSPs that would arise for weight 


Wi = I] see Methods for details), a is a parameter that scales the impact of synaptic plasticity in dependence on 
the current synaptic efficacy. The resulting activity-dependent component S{t){xi{t) — ae"'*) of the likelihood 
term is a simplified version of the standard STDP learning rule (Fig. |^, C), like in [37l[43] . Synaptic plasticity 
(STDP) for connections from input neurons to pyramidal cells in the WTA circuit can be understood from the 
generative aspect as fitting a mixture of Poisson (or other exponential family) distributions to high-dimensional 


spike inputs 1^,40 . The factor Wi = exp(0j — 8q) had been discussed in because it is compatible with the 


underlying generative model, but provides in addition a better fit to the experimental data of 36 


We examine in this section emergent properties of network plasticity in this simple spike-based neural 
network under the synaptic sampling rule ©• The network was exposed to a large number of handwritten 
digits from the MNIST dataset, like in Fig. However we examined in addition network configurations that 
result from changes in the distribution in input patterns. The change of the input distribution was implemented 
by presenting to the network in phase 1 only examples of the handwritten digit 1, then presenting in phase 2 
examples of the handwritten digits 1 and 2, and finally switching in phase 3 back to examples of the digit 1 
only (Fig. Ep). Inputs were presented sequentially and in random order, such that each sample was represented 


by a 200ms-long spike pattern (see Methods for details). The input samples were randomly drawn from the 
set of all 6742 handwritten digits 1 in the MNIST database during phase 1 and 3. During phase 2 the inputs 
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were drawn from this set of handwritten I’s combined with all 5958 handwritten digits 2 in MNIST. The digit 
presentations were interleaved with 50ms frames of random IHz inpnt activity. These inputs were injected into 
a small WTA circuit consisting of ten spiking neurons (Fig. EK; see [Method^ for details). Synaptic sampling 
according to © was applied continuously to all synapses from the 784 input neurons (encoding 28x28 input 
pixels) to the ten neurons in the WTA circuit. 

The resulting network plasticity, comprising structural plasticity and synaptic plasticity, is presented in 
Fig.[^-G. The evolution of synaptic strengths is shown in Fig.[^ for four of the ten neurons (synaptic weights 
projected back into the 28x28 input space for easier visualization). Synapse elimination and regrowth can be 
observed in Fig. w ,G. The prior acts as a constant force which tries to drive synapses towards elimination. 
Those synapses which have a strong positive likelihood gradient (second term in Eq. indicating an 

important position in explaining the input, survive. The remaining synapses are eliminated as they cross the 
zero threshold, thereby entering the “potential regrowth” regime. The randomly distributed isolated green 
dots in Fig. EF show tentative regrowth of such synapses. Fig. EP tracks the history of a single synaptic 
connection of neuron 4 (see red square in the bottom row of Fig. [^). Whereas its regrowth is not supported 
by the likelihood term during phases 1 and 3, it becomes stable during phase 2. Regrowth, as opposed to 
elimination, is almost entirely driven by the prior and the noise term dWi in ( |11[ ). This asymmetry between 
active and inactive synapses follows directly from our approach to map negative parameter values onto zero 
synaptic efficacy: varying the parameter within the negative range has no effect on the likelihood term. 

In contrast to preceding models for spine motility and STDP we have integrated here both into a single 
stochastic rule © that was derived from first principle (goal Q and corresponding rule Q). Note that a 
substantial stochastic component is essential for network plasticity in order to enable a network to reconfigure 
in response to changes in its input distribution as in Fig. At the same time the prior ensures that the network 
connections remain sparse, even in the face of challenges through a changing input distribution (see Fig. Ht 


2.7 Inherent network compensation capability through synaptic sampling 

Numerous experimental data show that the same function of a neural circuit is achieved in different individuals 
with drastically different parameters, and also that a single organism can compensate for disturbances by 
moving to a new parameter vector ©I 44[]47[ . These results suggest that there exists some low-dimensional 
submanifold of values for the high-dimensional parameter vector 0 of a biological neural network that all 
provide stable network function (degeneracy). We propose that the previously discussed posterior distribution 
of network parameters 6 provides a mathematical model for such low-dimensional submanifold. Furthermore we 
propose that the underlying continuous stochastic fluctuation dW provides a driving force that automatically 
moves network parameters (with high probability) to a functionally more attractive regime when the current 
solution becomes less performant because of perturbations, such as lesions of neurons or network connections. 
This compensation capability is not an add-on to the synaptic sampling model, but an inherent feature of its 
organization. 

We demonstrate this inherent compensation capability in Fig. for a generative spiking neural network 
with synaptic parameters 0 that regulate simultaneously structural plasticity and synaptic plasticity (dynamics 
of weights) as in Fig. andThe prior ps{0) for these parameters is here the same as in the preceding section 
(see Fig. S3 on the left). But in contrast to the previous section we consider here a network that allows us 
to study the self-organization of connections between hidden neurons. The network consists of eight WTA- 
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Figure 5: Inherent compensation for network perturbations. (A) A spike-based generative neural 
network (illustrated at the bottom) received simultaneously spoken and handwritten representations of the 
same digit (and for tests only spoken digits, see (B)). Stimulus examples for spoken and written digit 2 are 
shown at the top. These inputs are presented to the network through corresponding firing rates of “auditory” 
(xa) and “visual” (xy) input neurons. Two populations "la and zy of 40 neurons, each consisting of four WTA 
circuits like in Fig. receive exclusively auditory or visual inputs. In addition, arbitrary lateral excitatory 
connections between these “hidden” neurons are allowed. (B) Assemblies of hidden neurons emerge that encode 
the presented digit (I or 2). Top panel shows PETH of all neurons from zy for stimulus 1 (left) and 2 (right) 
after learning, when only an auditory stimulus is presented. Neurons are sorted by the time of their highest 
average firing. Although only auditory stimuli are presented, it is possible to reconstruct an internally generated 
“guessed” visual stimulus that represents the same digit (bottom). (C) First three PCA components of the 
temporal evolution of a subset 9' of network parameters 6 (see text). Two major lesions were applied to the 
network. In the first lesion (transition to red) all neurons that significantly encode stimulus 2 were removed from 
the population zy. In the second lesion (transition to green) all currently existing synaptic connections between 
neuron in za and zy were removed, and not allowed to regrow. After each lesion the network parameters O' 
migrate to a new manifold. (D) The generative reconstruction performance of the “visual” neurons zy for 
the test case when only an auditory stimulus is presented was tracked throughout the whole learning session, 
including lesions 1 and 2 (bottom panel). After each lesion the performance strongly degrades, but reliably 
recovers. Insets show at the top the synaptic weights of neurons in zy at 4 time points ti,... ,t 4 , projected 
back into the input space like in Fig. Network diagrams in the middle show ongoing network rewiring for 
synaptic connections between the hidden neurons za and zy. Each arrow indicates a functional connection 
between two neurons. To keep the figure uncluttered only subsets of synapses are shown (1% randomly drawn 
from the total set of possible lateral connections). Connections at time t 2 that were already functional at time 
ti are plotted in gray. The neuron whose parameter vector 6' is tracked in (C) is highlighted in red. The text 
under the network diagrams shows the total number of functional connections between hidden neurons at the 
time point. 
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circuits, but in contrast to Fig. we allow here arbitrary excitatory synaptic connections between neurons 
within the same or different ones of these WTA circuits. This network models multi-modal sensory integration 
and association in a simplified manner. Two populations of “auditory” and “visual” input neurons and xy 
project onto corresponding populations and zy of hidden neurons (each consisting of one half of the WTA 
circuits, see lower panel of Fig. [^). Only a fraction of the potential synaptic connections became functional 
(see supplementary Fig. SlA) through the synaptic sampling rule (0 that integrates structural and synaptic 
plasticity. Synaptic weights and connections were not forced to be symmetric or bidirectional. 

As in the previous demonstrations we do not use external rewards or teacher-inputs for guiding network 
plasticity. Rather, we allow the model to discover on its own regularities in its network inputs. The “auditory” 
hidden neurons z^ on the left in Fig. [5]A received temporal spike patterns from the auditory input neurons 
x^ that were generated from spoken utterings of the digit 1 and 2 (which lasted between 320 and 520ms). 
Simultaneously we presented to the “visual” hidden neurons zy on the right for the same time period a 
(symbolic) visual representation of the same digit (randomly drawn from the MNIST database like in Fig. 
and[^. 

The emergent associations between the two populations za and zy of hidden neurons were tested by 
presenting auditory input only and observing the activity of the “visual” hidden neurons zy. Fig. shows 
the emergent activity of the neurons zy when only the auditory stimulus was presented (visual input neurons 
xy remained silent). The generative aspect of the network can be demonstrated by reconstructing for this case 
the visual stimulus from the activity of the “visual” hidden neurons zy. Fig. shows reconstructed visual 
stimuli from a single run where only the auditory stimuli 'x.a for digits 1 (left) and 2 (right) were presented to 
the network. Digit images were reconstructed by multiplying the synaptic efficacies of synapses from neurons 
in Xy to neurons in zy (which did not receive any input from xy in this experiment) with the instantaneous 
firing rates of the corresponding zy-neurons. 

In order to investigate the inherent compensation capability of synaptic sampling, we applied two lesions 
to the network within a learning session of more than 7 hours. In the first lesion all neurons (16 out of 40) that 


became tuned for digit 2 in the preceding learning (see Fig. and Supplement) were removed. The lesion 
significantly impaired the performance of the network in stimulus reconstruction, but it was able to recover 
from the lesion after about one hour of continuing network plasticity according to Eq. 0 (Fig. [^). The 
reconstruction performance of the network was measured here continuously through the capability of a linear 
readout neuron from the visual ensemble to classify the current auditory stimulus {1 or 2). 

In the second lesion all synaptic connections between hidden neurons that were present after recovery from 
the first lesion were removed and not allowed to regrow (2936 synapses in total). After about two hours of 
continuing synaptic sampling 294 new synaptic connections between hidden neurons emerged. These made 
it again possible to infer the auditory stimulus from the activity of the remaining 24 hidden neurons in the 
population zy (in the absence of any input from the population xy), at about 75% of the performance level 
before the second lesion (see bottom panel of Fig. HP)- 

In order to illustrate the ongoing network reconfiguration we track in Fig. the temporal evolution of 
a subset O' of network parameters (35 parameters 9i associated with the potential synaptic connections of 
the neuron marked in red in the middle of Fig. HP from or to other hidden neurons, excluding those that 
were removed at lesion 2 and not allowed to regrow). The first three PCA components of this 35-dimensional 
parameter vector are shown. The vector O' fluctuates first within one region of the parameter space while 
probing different solutions to the learning problem, e.g., high probability regions of the posterior distribution 
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(blue trace). Each lesions induced a fast switch to a different region (red,green), accompanied by a recovery of 
the visual stimulus reconstruction performance (see Fig. [^). 

Altogether this experiment showed that continuously ongoing synaptic sampling maintains stable network 
function also in a more complex network architecture. Another consequence of synaptic sampling was that the 
neural codes (assembly sequences) that emerged for the two digit classes within the hidden neurons and zy 
(see supplementary Fig. SIB) drifted over larger periods of time (also in the absence of lesions), similarly as 
observed for place cells in and for tuning curves of motor cortex neurons in j^. 


3 Discussion 


We have shown that stochasticity may provide an important function for network plasticity. It enables networks 
to sample parameters from some low-dimensional manifold in a high-dimensional parameter space that repre¬ 
sents attractive combinations of structural constraints and rules (such as sparse connectivity and heavy-tailed 
distributions of synaptic weights) and a good fit to empirical evidence (e.g., sensory inputs). We have developed 
a normative model for this new learning perspective, where the traditional gold standard of maximum likelihood 
optimization is replaced by theoretically optimal sampling from a posterior distribution of parameter settings, 
where regions of high probability provide a theoretically optimal model for the low-dimensional manifold from 
which parameter settings should be sampled. The postulate that networks should learn such posterior distribu¬ 
tions of parameters, rather than maximum likelihood values, had been proposed already for quite some while 
for artificial neural networks , since such organization of learning promises better generalization capability 
to new examples. The open problem how such posterior distributions could be learned by networks of neurons 


in the brain, in a way that is consistent with experimental data, has been highlighted in 10 as a key challenge 


for computational neuroscience. We have presented here such a model, whose primary innovation is to view 
experimentally found trial-to-trial variability and ongoing fluctuations of parameters such as spine volumes no 
longer as a nuisance, but as a functionally important component of the organization of network learning, since 
it enables sampling from a distribution of network configurations. The mathematical framework that we have 
presented provides a normative model for evaluating such empirically found stochastic dynamics of network 
parameters, and for relating specific properties of this “noise” to functional aspects of network learning. 

Reports of trial-to-trial variability and ongoing fluctuations of parameters related to synaptic weights are 
ubiquitous in experimental studies of synaptic plasticity and its molecular implementation, from fluctuations 
of proteins such as PSD-95 in the postsynaptic density that are thought to be related to synaptic strength, 
over intrinsic fluctuations in spine volumes and synaptic connections iilzll ^1^1^, to surprising shifts of 
neural codes on a larger time scale [6,48 . These fluctuations may have numerous causes, from noise in the 
external environment over noise and fluctuations of internal states in sensory neurons and brain networks, 
to noise in the pre- and postsynaptic molecular machinery that implements changes in synaptic efficacies on 


various time scales 20 . One might even hypothesize, that it would be very hard for this molecular machinery 


to implement synaptic weights that remain constant in the absence of learning, and deterministic rules for 
synaptic plasticity, because the half-life of many key proteins that are involved is relatively short, and receptors 
and other membrane-bound proteins are subject to Brownian motion. In this context the finding that neural 


codes shift over time (^[48] appears to be less surprising. In fact, our model predicts (see Supplement) that also 
stereotypical assembly sequences that emerge in our model through learning, similarly as in the experimental 
data of (4^, are subject to such shifts on a larger time scale. However it should be pointed out that our model 
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is agnostic with regard to the time scale on which these changes occur, since this time scale can be defined 
arbitrarily through the parameter b (learning rate) in Eq. (©• 

The model that we have presented makes no assumptions about the actual sources of noise. It only assumes 
that salient network parameters are subject to stochastic processes, that are qualitatively similar to those 
which have been studied and modeled in the context of Brownian motion of particles as random walk on the 
microscale. One can scale the influence of these stochastic forces in the model by a parameter T that regulates 
the “temperature” of the stochastic dynamics of network parameters 0. This parameter T regulates the tradeoff 
between trying out different regions (or modes) of the posterior distribution of 6 (exploration), and staying 
for longer time periods in a high probability region of the posterior (exploitation). We conjecture that this 
parameter T varies in the brain between different brain regions, and possibly also between different types of 
synaptic connections within a cortical column. For example, spine turnover is increased for large values of T, 
and network parameters 6 can move faster to a new peak in the posterior distribution, thereby supporting 
faster learning (and faster forgetting). Since spine turnover is reported to be higher in the hippocampus than 
in the cortex , such higher value of T could for example be more adequate for modeling network plasticity in 


the hippocampus. This model would then also support the hypothesis of 50 that memories are more transient 
in the hippocampus. In addition T is likely to be regulated on a larger time scale by developmental processes, 
and on a shorter time scale by neuromodulators and attentional control. The view that synaptic plasticity 
is stochastic had already been explored through simulation studies in [6,51 . Artificial neural networks were 
trained in [^ through supervised learning with high learning rates and high amounts of noise both on neuron 
outputs and synaptic weight changes. The authors explored the influence of various combinations of noise levels 
and learning rates on the success of learning, which can be understood as varying the temperature parameters T 
in the synaptic sampling framework. In order to measure this parameter T experimentally in a direct manner, 
one would have to apply repeatedly the same plasticity induction protocol to the same synapse, with a complete 
reset of the internal state of the synapse between trials, and measure the resulting trial-to-trial variability of 
changes of its synaptic efficacy. Since such complete reset of a synaptic state appears to be impossible at 
present, one can only try to approximate it by the variability that can be measured between different instances 
of the same type of synaptic connection. 

We have shown that the Fokker-Planck equation, a standard tool in physics for analyzing the temporal 
evolution of the spatial probability density function for particles under Brownian motion, can be used to 
create bridges between details of local stochastic plasticity processes on the microscale and the probability 
distribution of the vector 6 of all parameters on the network level. This theoretical result provides the basis 
for the new theory of network plasticity that we are proposing. In particular, this link allows us to derive rules 
for synaptic plasticity which enable the network to learn, and represent in a stochastic manner, a desirable 
posterior distribution of network parameters; in other words: to approximate Bayesian inference. 

We find that resulting rules for synaptic plasticity contain the familiar term for maximum likelihood learning. 
But another new term, apart from the Brownian-motion-like stochastic term, is the term ^\og ps{0i) that 
results from a prior distributions ps{(^i), which could actually be different for each biological parameter 6i 
and enforce structural requirements and preferences of networks of neurons in the brain. Some systematic 
dependencies of changes in synaptic weights (for the same pairing of pre- and postsynaptic activity) on their 
current values had already been reported in [M,|M-^. These can be modeled as impact of priors. Other 
potential functional benefits of priors (on emergent selectivity of neurons) have recently been demonstrated 
in [^ for a restricted Boltzmann machine. An interesting open question is whether the non-local learning 


20 







rules of their model can be approximated through biologically more realistic local plasticity rules, e.g. through 
synaptic sampling. We also have demonstrated in Fig. that suitable priors can model experimental data 
from 34 on the survival statistics of dendritic spines. Finally, we have demonstrated in Fig. and that 
suitable priors for network parameters 6i that model spine volumes endow a neural network with the capability 
to respond to changes in the input distribution and network perturbations with a network rewiring that 
maintains or restores the network function, while simultaneously observing structural constraints such as sparse 
connectivity. 

Our model underlines the importance of further experimental investigation of priors for network parameters. 
How are they implemented on a molecular level? What role does gene regulation have in their implementation? 
How does the history of a synapse affect its prior? In particular, can consolidation of a synaptic weight 9i 
be modeled in an adequate manner as a modification of its prior? This would be attractive from a functional 
perspective, because according to our model it both allows long-term storage of learned information and flexible 
network responses to subsequent perturbations. 

We have focused in the examples for our model on the plasticity of synaptic weights and synaptic con¬ 
nections. But the synaptic sampling framework can also be used for studying the plasticity of other synaptic 
parameters, e.g., parameters that control the short term dynamics of synapses, i.e., their individual mixture 


of short term facilitation and depression. The corresponding parameters U,D,F of the models from 56,57 


are known to depend in a systematic manner on the type of pre- and postsynaptic neuron [58] , indicative of a 
corresponding prior. However also a substantial variability within the same type of synaptic connections, had 
been found |58| . Hence it would be interesting to investigate functional properties and experimentally testable 
consequences of stochastic plasticity rules of type ([^ for U,D,F, and to compare the results with those of 
previously considered deterministic plasticity rules for U,D,F (see e.g., (^). 

We have demonstrated that our framework provides a new and principled way of modeling structural 
plasticity 12 [^ . The challenge to find a biologically plausible way of modeling structural plasticity as Bayesian 
inference has been highlighted by 10 . One nice feature of our approach is that network rewiring and changes of 
synaptic weights are modeled by a single rule, that can be directly related to functional aspects of the network 
via the resulting posterior distribution. We have shown in Fig. and that this rule produces a population of 
persistent synapses that remain stable over long periods of time, and another population of transient synaptic 
connections which disappear and reappear randomly, thereby supporting automatic adaptation of the network 
structure to changes in the distribution of external inputs (Fig. and network perturbation (Fig. [^. 

On a more general level we propose that a framework for network plasticity where network parameters are 
sampled continuously from a posterior distribution will automatically be less brittle than previously considered 
maximum likelihood learning frameworks. The latter require some intelligent supervisor who recognizes that 
the solution given by the current parameter vector is no longer useful, and induces the network to resume 
plasticity. In contrast, plasticity processes remain active all the time in our sampling-based framework. Hence 
network compensation for external or internal perturbations is automatic and inherent in the organization of 
network plasticity. 

The need to rethink observed parameter values and plasticity processes in biological networks of neurons in 
a way which takes into account their astounding variability and compensation capabilities has been emphasized 
by Eve Harder (see e.g. m 47||60] ) and others. This article has introduced a new conceptual and mathematical 
framework for network plasticity that promises to provide a foundation for such rethinking of network plasticity. 
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4 Methods 


4.1 Details to Learning a posterior distribution through stochastic synaptic plasticity 

Here we prove that p*{0) = p{6\ x) is the unique stationary distribution of the parameter dynamics Q that 
operate on the network parameters 0 = {9i,..., 9m)- Convergence to this stationary distribution then follows 
for strictly positive p*{6)- In fact, we prove here a more general result for parameter dynamics given by 


d9i = [b{9i)-^\ogps{0) + h{9i)^\ogp^f{:K\e)+Tb'{9i)] dt + \j2Tb{9i) dWi 


( 12 ) 


(for i = 1,..., M). This dynamics includes a temperature parameter T and a sampling-speed factor b{9i) that 
can in general depend on the current value of the parameter 9i. The temperature parameter T can be used to 
scale the diffusion term (i.e., the noise). The sampling-speed factor controls the speed of sampling, i.e., how 
fast the parameter space is explored. It can be made dependent on the individual parameter value without 
changing the stationary distribution. For example, the sampling speed of a synaptic weight can be slowed down 
if it reaches very high or very low values. Note that the dynamics Q is a special case of the dynamics (12) 
with unit temperature T = 1 and constant sampling speed b[9i) = b. We show that the stochastic dynamics 


(12) leaves the distribution 


invariant, where Z \s a. normalizing constant Z = J q*{6)d0 and 

q*{9) = p{0 I x) T . 


(13) 


(14) 


Note that the stationary distribution p*{9) is shaped by the temperature parameter T, in the sense that p*{9) 
is a flattened version of the posterior for high temperature. The result is formalized in the following theorem. 


which is proven in detail in Supplement 


Theorem 1. Let p(x, 9) be a strietly positive, continuous probability distribution over continuous or discrete 
states and eontinuous parameters 9 = {9i,... ,9m), twice continuously differentiable with respeet to 9. Let 
b{9) be a strictly positive, twiee continuously differentiable function. Then the set of stochastic differential 


equations (12) leaves the distribution p* (9) invariant. Furthermore, p*{9) is the unique stationary distribution 


of the sampling dynamics. 


4.1.1 Online approximation 

We show here that the rule ([^ is a reasonable approximation to the batch-rule ([^. According to the dynamics 
(d), synaptic plasticity rules that implement synaptic sampling have to compute the log likelihood derivative 
logpj\f{yi\9). We assume that every Tx time units a different input is presented to the network. For 
simplicity, assume that x^,...,x^ are visited in a fixed regular order. Under the assumption that input 
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patterns are drawn independently, the likelihood of the generative model factorizes 


N 


Pm{^, I^) = 


( 15 ) 


The derivative of the log likelihood is then given by 


n=l 


N 


^ logpAr(x|0) = ^-^ logpj^{x^\e) . 


da 


n=l 


(16) 


Using Eq. (16) in the dynamics (12), one obtains 

dOi = (b{e^)^iogps{e) + b{6i)Y,^^ogpM{x'^\e) + Ty{di)] dt + dWi. 


n=l 


(17) 


Hence, the parameter dynamics depends at any time on all network inputs and network responses. 

This “batch” dynamics does not map readily onto a network implementation becanse the weight npdate 
requires at any time knowledge of all inpnts x'^. We provide here an online approximation for small sampling 
speeds. To obtain an online learning rule, we consider the parameter dynamics 

d9^ = (b{9^)^ logpsie)+Nb{9i)^logpM{x^\e) + Tb'i9i)^ dt +^/2f^dW^. (18) 

As in the batch learning setting, we assume that each input x'^ is presented for a time interval of r^. Integrating 


the parameter changes (18) over one full presentation of the data x, i.e., starting from t = 0 with some initial 


parameter values 0(0) up to time t = Ntx, we obtain for slow sampling speeds {NTxb{9i) <C 1) 
9i{NTx) - 9i{0) ^ Ntx \ogps{e) + b{9i) 4: iogpM{x^\e) +Tb'{9i) 


n=l 


+ ^y2Tb{9i) (Wf^" - W°). 


(19) 


This is also what one obtains when integrating Eq. 0 for Ntx time units (for slow b{9i)). Hence, for slow 
enough b(9i), Eq. (18) is a good approximation of optimal weight sampling. The npdate rnle ([^ follows from 


(18) for T = 1 and b{9i) = b. 


4.1.2 Discrete time approximation 

Here we provide the derivation for the approximate discrete time learning rnle 0. For a discrete time parameter 
update at time t with discrete time step At during which x^ is presented, a corresponding rnle can be obtained 


by short integration of the continuous time rule (18) over the time interval from t to t + At: 


A9i = At {h{9i)^^ \ogps{e) + Nh{9i)^\ogpM{x^\e) + Tb'{9i)^ + ^2^^ (Wf - W*) (20) 

= At(h{9i)^\ogps{e) + Nh{9i)^\ogpM{x^\e) + Tb'{9,)\ + ^2TAtb{9i)vl ( 21 ) 
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where uj denotes Gaussian noise uj ~ Normal(0, 1). The update rule Q is obtained by choosing a constant 
b{9) = b, T = 1, and defining rj = Atb. 


4.1.3 Synaptic sampling with hidden states 

When there is a direct relationship between network parameters 6 and the distribution over input patterns jc”, 
the parameter dynamics can directly be derived from the derivative of the data log likelihood and the derivative 
of the parameter prior. Typically however, generative models for brain computation assume that the network 
response z'^ to input pattern represents in some manner the value of hidden variables that explain current 
input pattern. In the presence of hidden variables, maximum likelihood learning cannot be applied directly, 
since the state of the hidden variables is not known from the observed data. The expectation maximization 
algorithm can be used to overcome this problem. We adopt this approach here. In the online setting, 
when pattern is applied to the network, it responds with network state z'^ according to pj\f{z\x^, 0), where 
the current network parameters are used in this inference process. The parameters are updated in parallel 
according to the dynamics 


dOi = {b{ei)-^iogps{e) + Nbiei)-^iogpM{x^,z^\e) + Tb'{ei)) dt +^/2Tb(^dWi. (22) 


Note that in comparison with the dynamics (18), the likelihood term now also contains the current network 
response 2 ”. It can be shown that this dynamics leaves the stationary distribution 


P*{9) = —p(6'|x,z)t, 


(23) 


invariant, where Z is again a normalizing constant (the dynamics (22) is again the online-approximation). 
Hence, in this setup, the network samples concurrently from circuit states (given 6) and network parameters 
(given the network state z”), which can be seen as a sampling-based version of online expectation maximization. 


4.2 Details to Improving the generalization capability of a neural network through 
synaptic sampling (Fig. 2) 

For learning the distribution over different writings of digit 1 with different priors in Fig. a restricted 
Boltzmann machine (RBM) with 748 visible and 9 hidden neurons was used. A detailed definition of the RBM 


model and additional details to the simulations are given in Supplement 


4.2.1 Network inputs 

Handwritten digit images were taken from the MNIST dataset j^. In MNIST, each instance of a handwritten 
digit is represented by a 784-dimensional vector x'^. Each entry is given by the gray-scale value of a pixel in the 
28 X 28 pixel image of the handwritten digit. The pixel values were scaled to the interval [0,1]. In the RBM, 
each pixel was represented by a single visible neuron. When an input was presented to the network, the output 
of a visible neuron was set to 1 with probability as given by the scaled gray-scale value of the corresponding 
pixel. 
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4.2.2 Learning procedure 


In each parameter update step the contrastive divergence algorithm of [29] was used to estimate the likelihood 
gradients. Therefore, each update step consisted of a “wake” phase, a “reconstruction” phase, and the update of 
the parameters. The “wake” samples were generated by setting the outputs of the visible neurons to the values 
of a randomly chosen digit from the training set and drawing the outputs z'^ of all hidden layer neurons 
for the given visible output. The “reconstruction” activities x'j and zf were generated by starting from this 
state of the hidden neurons and then drawing outputs of all visible neurons. After that, the hidden neurons 
were again updated and so on. In this way we performed five cycles of alternating visible and hidden neuron 
updates. The outputs of the network neurons after the fifth cycle were taken as the resulting “reconstruction” 


samples x” and zf and used for the parameter updates (24)-(26) given below. This update of parameters 
concluded one update step. 

Log likelihood derivatives for the biases 6^'^ of hidden neurons are approximated in the contrastive diver¬ 
gence algorithm 29 as log pj\f{x'^\9) « z^ — z^ (the derivatives for visible biases are analogous). Using 


Eq. the synaptic sampling update rules for the biases are thus given by 

Abf^=rjNizf-z^) + 

A6™ = r]N [x] - x]) + 


(24) 

(25) 


Note that the parameter prior does not show up in these equations since no priors were used for the biases 
in our experiments. Contrastive divergence approximates the log likelihood derivatives for the weights Wij as 
logpj\f{x^\6) zfx'j — z^x^. This leads to the synaptic sampling rule 

Awij = r] (^-7^ logp 5 (w) -h N {z^x] - 4’"^”)) + ■ (26) 

In the simulations, we used this rule with rj = 10 “^ and N = 100 . Learning started from random initial 
parameters drawn from a Gaussian distribution with standard deviation 0.25 and means at 0 and —1 for 
weights Wij and biases (6^*'^, by^), respectively. 

To compare learning with and without parameter priors, we performed simulations with an uninformative 
(i.e., uniform) prior on weights (Fig. j^,D), which was implemented by setting g;^logp 5 (w) to zero. In 
simulations with a parameter prior (Fig. [^,F), we used a local prior for each weight in order to obtain local 
plasticity rules. In other words, the prior P 5 (w) was assumed to factorize into priors for individual weights 
P 5 (w) = Y\i jPsiwij). For each individual weight prior, we used a bimodal distribution implemented by a 
mixture of two Gaussians 


Ps{wij) = 0.5 Normal (rcjj | pi,(Ti) + 0.5 Normal (tcjj | P2,cr2) , (27) 

with means pi = 1.0, p 2 = 0.0, and standard deviations ui = cj 2 = 0.15. 

4.3 Details to Fast adaptation to changing input statistics (Fig. 

4.3.1 Spike-based Winner-Take-All network model 

Neuron model: Network neurons were modeled as stochastic spike response neurons with a firing rate that 
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depends exponentially on the membrane voltage [M, 63 . The membrane potential Uk (t) of neuron k is given 
by 

= '^Wki Xi{t) + Pkit) , (28) 


where Xi{t) denotes the (unweighted) input from input neuron i, Wki denotes the efficacy of the synapse from 
input neuron f, and /3fc(f) denotes a homeostatic adaptation current (see below). The input Xi{t) models the 
influence of additive excitatory postsynaptic potentials (EPSPs) on the membrane potential of the neuron. Let 
,... denote the spike times of input neuron i. Then, Xi{t) is given by 


Xi{t) = '^e{t-tP), 
/ 


(29) 


where e is the response kernel for synaptic input, i.e., the shape of the EPSP, that had a double-exponential 
form in our simulations: 

e(s) = 0(s) — e , (30) 

with the rise-time constant = 2ms, the fall-time constant Tf = 20ms. 0(-) denotes the Heaviside step 
function. The instantaneous firing rate Pk{t) of network neuron k depends exponentially on the membrane 
potential and is subject to divisive lateral inhibition Iiat(i) (described below): 

Pk{t) = exp(ufc(t)) , (31) 

where pnet = lOOHz scales the firing rate of neurons. Such exponential relationship between the membrane 
potential and the firing rate has been proposed as a good approximation to the firing properties of cortical 
pyramidal neurons [62] . Spike trains were then drawn from independent Poisson processes with instantaneous 
rate pk{t) for each neuron. We denote the resulting spike train of the k^^ neuron by Sk{t). 

Homeostatic adaptation current: Each output spike caused a slow depressing current, giving rise to the 
adaptation current /3k{t). This implements a slow homeostatic mechanism that regulates the output rate of 
individual neurons (see |64j for details). It was implemented as 

Pk{t) = (32) 

/ 


f f) 

where t): ' denotes the /-th spike of neuron k and k is an adaptation kernel that was modeled as a double 


exponential (Eq. (30)) with time constants = 12s and Tf = 30s. The scaling parameter 7 was set to 7 = — 8 . 


Lateral inhibition: Divisive inhibition |42j between the K neurons in the WTA network was implemented in 


an idealized form 37 


K 

7iat(i) = y~]exp(uf(t)). 

i=i 


(33) 


This form of lateral inhibition, that assumes an idealized access to neuronal membrane potentials, has been 
shown to implement a well-defined generative network model |37j, see below. 
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4.3.2 Synaptic sampling in spike-based Winner-Take-All networks as stochastic STDP 

It has been shown in that the WTA-network defined above implicitly defines a generative model that 
is a mixture of Poissonian distributions. In this generative model, inputs a;"' are assumed to be generated 
in dependence on the value of a hidden multinomial random variable /i” that can take on K possible values 
Each neuron k in the WTA circuit corresponds to one value k of this hidden variable. In the 
generative model, for a given value of = k, the value of an input xf is then distributed according to a 
Poisson distribution with a mean that is determined by the synaptic weight Wki from input neuron i to network 
neuron k: 


= k,w) = PoiSSON(x”|Q;e""=‘), 


(34) 


with a scaling parameter a > 0. In other words, the synaptic weight Wki encodes (in log-space) the firing 
rate of input neuron i, given that the hidden cause is k. For a given hidden cause, inputs are assumed to be 
independent, hence one obtains the probability of an input vector for a given hidden cause as 


= k,w) = PoiSSON(a:"|ae"’'“*). 


(35) 


The network implements inference in this generative model, i.e., for a given input *”■, the firing rate of network 
neuron Zk is proportional to the posterior probability p{h'^ = k\x'^,w) of the corresponding hidden cause. An 
online maximum likelihood learning rule for this generative model was derived in [40| . It changes synaptic 
weights according to 


d 

dwki 


logPAr(®” I w) ^ Skit) ixi{t) -ae^^*) , 


(36) 


where Skit) denotes the spike train of the postsynaptic neuron and Xiit) denotes the weight-normalized value 
of the sum of EPSPs from presynaptic neuron i at time t (i.e., the summed EPSPs that would arise for weight 


Wki = 1; see Section 4.3.1 for details). To define the synaptic sampling learning rule completely, we also 
need to define the parameter prior. In our experiments, we used a simple Gaussian prior on each parameter 
Psi^) = Ofc i NORMAL(6lfcj|/r, cj^) with p = 0.5 and a = 1. The derivative of the log-prior is given by 


d 1 

log PS iO) = -^i^ki-p) 
OOki CF^ 


(37) 


Inserting Eqs. (36) and (37) into the general form (10), we find that the synaptic sampling rule is given by 

dOki = b (^^iOki-p) + NwkiSkit) ixiit) - ae^’^')'^ dt +V^d'Wki, (38) 

which corresponds to rule (0 with double indices ki replaced by single parameter indexing i to simplify 
notation. 
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4.3.3 Simulation details for spiking network simulations (Figs. 3,4, and 5 


Computer simulations of spiking nenral networks were based on adapted event-based simnlation software from 
[41] . In all spiking neural network simulations, synaptic weights were updated according to the rule © with 
parameters N = 100, a = e“^, and b = 10“^, except for panel where 6 = 10 ® was used as a control. 
In the simulations, we directly implemented the time-continuous evolution of the network parameters in an 
event-based update scheme. Before learning, initial synaptic parameters were independently drawn from the 
prior distribntion ps{G)- 

For the mapping Q from synaptic parameters 9ki to synaptic efficacies Wki, we used as offset 9o = 3. This 
results in synaptic weights that shrink to small valnes (< 0.05) when synaptic parameters are below zero. In the 
simulation, we clipped the synaptic weights to zero for negative synaptic parameters 9 to account for retracted 


synapses. More precisely, the actual weights Wki used for the computation of the membrane potential (28) were 


given by w^i = max{0,t(;fcj — exp(— 6 * 0 )} . To avoid numerical problems, we clipped the synaptic parameters 
at —5 and the maximum amplitude of instantaneous parameter changes to 56. 


4.3.4 Network inputs for Figs, and 4 


Handwritten digit images were taken from the MNIST dataset | 6 l| . Each pixel was represented by a single 
afferent neuron. Gray scale values where scaled to 0-50Hz Poisson input rate and IHz input noise was added 
on top. These Poisson rates were kept fixed for each example input digit for a duration of 200ms. After each 
digit presentation, a 50ms window of IHz Poisson noise on all input channels was presented before the next 
digit was shown. 


4.4 Details to Inherent compensation capabilities of networks with synaptic sampling 
(Fig. 5) 

Here we provide details to the network model and spiking inputs for the recurrent WTA circuits. Additional 
details to the data analysis and performance evaluation are provided in Supplement! 


4.4.1 Network model 

In Fig.j^two recurrently connected ensembles, each consisting of four WTA circuits, were used. The parameters 


of neuron and synapse dynamics were as described in Section 4.3.1 All synapses, lateral and feedforward, were 
subject to the same learning rule ©• Lateral connections within and between the WTA Circuit neurons were 
unconstrained (allowing potentially all-to-all connectivity). Connections from input neurons were constraint 
as shown in Fig. The lateral synapses were treated in the same way as synapses from input neurons but had 
a synaptic delay of 5ms. 


4.4.2 Network inputs 

The spoken digit presentations were given by reconstructed cochleagrams of speech samples of isolated spoken 


digits from the TI 46 dataset (also used in 43 ^). Each of the 77 channels of the cochleagrams was represented 


by 10 afferent neurons, giving a total of 770. Cochleagrams were normalized between 0 and 80Hz and used to 
draw individual Poisson spike trains for each afferent neuron. In addition IHz Poisson noise was added on top. 
We used 10 different utterances of digits 1 and ^ of a single speaker. We selected 7 utterances for training and 
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3 for testing. For training, one randomly selected utterance was presented together with a randomly chosen 
instance of the corresponding handwritten digit. The spike patterns for the written digits were generated as in 


Section 2.6 and had the same duration as the spoken digits. Each digit presentation was padded with 25ms, 
IHz Poisson noise before and after the digit pattern. 

For test trials in which only the auditory stimulus was presented, the activity of the visual input neurons 
was set to IHz throughout the whole pattern presentation. The learning rate b was set to zero during these 
trials. The PETH plots were computed over 100 trial responses of the network to the same stimulus class (e.g. 
presentation of digit 1). Spike patterns for input stimuli were randomly drawn in each trial for the given rates. 
Spike trains were then filtered with a Gaussian filter with a = 50ms and summed in a time-discrete matrix 
with 10ms bin length. Maximum firing times were assigned to the time bin with the highest PETH amplitude 
for each neuron. 


5 Supplement 

The supplement is available on the author’s web page http://www.igi.tugraz.at/kappel/. 
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